96
Binary Neural Architecture Search
(a)
(b)
(c)
FIGURE 4.2
(a) A cell containing four intermediate nodes B1, B2, B3, B4 that apply sampled operations
on the input node B0. B0 is from the output of the last cell. The output node concatenates
the outputs of the four intermediate nodes. (b) Gabor Filter. (c) A generic denoising block.
Following [253], it wraps the denoising operation with a 1 × 1 convolution and an identity
skip connection [84].
we progressively abandon the worst-performing operation and sample the operations with
little expectations but a significant variance for each edge. Unlike [291], which uses the
performance as the evaluation metric to decide which operation should be pruned, we use
the anti-bandit algorithm described in Section 4.2.1 to make a decision.
Following UCB in the bandit algorithm, we obtain the initial performance for each
operation on every edge. Specifically, we sample one of the K operations in Ω(i,j) for every
edge, then obtain the validation accuracy a, which is the initial performance m(i,j)
k,0
by
adversarially training the sampled network for one epoch and finally assigning this accuracy
to all the sampled operations.
By considering the confidence of the kth operation using Eq. 4.8, the LCB is calculated
by
sL(o(i,j)
k
) = m(i,j)
k,t
−
2 log N
n(i,j)
k,t
,
(4.9)
where N is the total number of samples, n(i,j)
k,t
refers to the number of times the kth operation
of the edge (i, j) has been selected and t is the epoch index. The first item in Eq. 4.9 is the
value term (see Eq. 4.2) which favors the operations that look good historically, and the
second is the exploration term (see Eq. 4.3) which allows operations to get an exploration
bonus that grows with log N. The selection probability for each operation is defined as
p(o(i,j)
k
) =
exp{−sL(o(i,j)
k
)}
m exp{−sL(o(i,j)
m
)}
.
(4.10)
The minus sign in Eq. 4.10 means that we prefer to sample operations with a smaller
confidence. After sampling one operation for every edge based on p(o(i,j)
k
), we obtain the
validation accuracy a by training adversarially the sampled network for one epoch, and then
update the performance m(i,j)
k,t
that historically indicates the validation accuracy of all the
sampled operations o(i,j)
k
as
m(i,j)
k,t
= (1 −λ)m(i,j)
k,t−1 + λ ∗a,
(4.11)
where λ is a hyperparameter.